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Abstract 

Background: Support vector regression (SVR) and Gaussian process regression (GPR) were used for the analysis of 
electroanalytical experimental data to estimate diffusion coefficients. 

Results: For simulated cyclic voltammograms based on the EC, E qr , and E qr C mechanisms these regression 
algorithms in combination with nonlinear kernel/covariance functions yielded diffusion coefficients with higher 
accuracy as compared to the standard approach of calculating diffusion coefficients relying on the Nicholson-Shain 
equation. The level of accuracy achieved by SVR and GPR is virtually independent of the rate constants governing the 
respective reaction steps. Further, the reduction of high-dimensional voltammetric signals by manual selection of 
typical voltammetric peak features decreased the performance of both regression algorithms compared to a 
reduction by downsampling or principal component analysis. After training on simulated data sets, diffusion 
coefficients were estimated by the regression algorithms for experimental data comprising voltammetric signals for 
three organometallic complexes. 

Conclusions: Estimated diffusion coefficients closely matched the values determined by the parameter fitting 
method, but reduced the required computational time considerably for one of the reaction mechanisms. The 
automated processing of voltammograms according to the regression algorithms yields better results than the 
conventional analysis of peak-related data. 

Keywords: Support vector regression, Gaussian process regression, Diffusion coefficient, Principal component 
analysis, Voltammetry, Reaction mechanism 



Background 

Voltammetric signals are measurements of the current 
flowing through an electrode as a function of an externally 
controlled electrode potential. For example, in a simple 
case for an initial oxidation, during a single cycle in cyclic 
voltammetry the electrode potential first increases lin- 
early with time and, upon reaching the switching poten- 
tial, decreases linearly back to the starting potential [1,2]. 
It has been argued that voltammetric techniques have 
found widespread use due to their high sensitivity, ade- 
quate selectivity, and ready availability of instrumentation 
[3] . Measurements of cyclic voltammetric signals provide 
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detailed information about reactions which include, or 
are coupled to, electron transfer steps, and thus enable 
the analysis of the underlying mechanisms [4]. In a spe- 
cial context, these measurements are used, for example, 
to study the release of neurotransmitters [5], and to char- 
acterize the electrochemical properties of recording and 
stimulation microelectrodes in neuroscience research [6]. 

Automated acquisition of experimental data [7,8] and 
computer simulations of electrochemical systems [9,10] 
play an important role in modern electrochemistry. Due 
to the wide applicability and high speed of voltammetric 
experiments [3], data analysis methods are required to aid 
electrochemists in extracting knowledge about electro- 
chemical systems [11-14]. Recently proposed data analysis 
methods include, for example, multi-parameter estima- 
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tion from hypersurface models [15,16], artificial neural 
networks for classifying voltammetric signals by reac- 
tion mechanism [17], and bootstrap resampling to extract 
system parameters and their error distributions [18]. 

The diffusion coefficient D is an important physical 
parameter of the species involved in an electrochemi- 
cal reaction, that describes diffusional transport. Since 
Nicholson and Shains classical treatment [1], diffusion 
coefficients are directly extracted from voltammetric 
signals based on theoretical relations (Randles-Sevcik 
equation), valid for particular electrode reaction mecha- 
nisms. Recently analytical solutions for calculating the dif- 
fusion coefficient from flux data have also been proposed 
[19,20], but are restricted to pure diffusive and diffusive- 
convective conditions. Semiintegral analysis provides a 
"linearization" method that allows D to be determined 
for single electron transfers without kinetic complications 
[21]. As an alternative, fitting of simulated voltammet- 
ric features to experimental data [11,15,16,22], or full 
current/potential curves [23,24] may provide values for 
D. Both approaches have limitations: Theoretical rela- 
tionships are only valid for certain reaction mechanisms 
and kinetic schemes, while the fitting of simulated data 
requires formulation of a reasonable mechanistic hypoth- 
esis, substantial computation time and is very sensitive to 
the initialization of the electrochemical system parame- 
ters [15]. Non-electrochemical approaches to determine 
D include PGSE-NMR spectroscopy [25,26]. However, 
these require expensive instrumentation and considerable 
additional expertise. 

To overcome such limitations, we investigate the esti- 
mation of diffusion coefficients from experimental cyclic 
voltammograms by means of two function estimation 
techniques, support vector regression (SVR) and Gaus- 
sian process regression (GPR) [27,28]. Support vector 
machines, as a tool for both regression and classification, 
have recently gained popularity across different applica- 
tion fields such as genetics [29], neuroscience [30,31], 
quantum chemistry [32], spectroscopy [33-35], and elec- 
trochemistry [36]. Similar to support vector machines, 
Gaussian processes have lately seen a revival of interest 
due to their combination with covariance kernels [28] and 
were successfully applied to problems in (bio)chemistry 
and robotics concerning micro-array analysis [37], and 
decoding of spike trains [38]. 

Methods 

In the following, / will denote a scalar function, 
mapping vectors x e R n to a scalar jgI. Then, the 
estimation of diffusion coefficients from voltammet- 
ric signals is equivalent to estimating the unknown 
function fix) i-> y, where x is a cyclic voltammo- 
gram (CV) and y e R the diffusion coefficient D. 
Function / hence describes the relationship between 



experimentally acquired data (CVs) and an unknown 
physical property (D) of the electrochemical species. 
The following Sections "Support vector regression" and 
"Gaussian processes" introduce two different techniques 
for estimating function / 

Support vector regression 

Support Vector Regression (SVR) [27] is a method to esti- 
mate f{x) i-> y } given a set of data points (xuyi), i = 
1, . . . , m. In the application at hand each data point 
(Xi,yi) e R n x R consists of a complete CV and the 
respective diffusion coefficient D. To introduce the SVR 
algorithm, we first consider estimation of linear functions 
fix) = (w,x) + b, where w eW 1 denotes the weight vec- 
tor and b e R the bias term, or offset. For simple linear 
regression the parameters w and b are determined by min- 
imizing the quadratic loss l 2 (f(Xi) — yt) = ifixt) — yi) 2 
(Figure 1A), across all of the data points. In other words, 
one solves the optimization problem (1). 

m 

minV (f(Xi)- yi ) 2 (1) 



In equation (1), the sum of all (f(xi) —yd) 2 is minimized 
with respect to the weight vector w and offset b. After 
finding w and b, diffusion coefficients are estimated for 
previously unseen cyclic voltammograms by evaluating/ 
In general, function / relating voltammograms and diffu- 
sion coefficients will not be linear and we will describe the 
extension to estimating nonlinear functions later in this 
paragraph. 

Usually, one is interested in a high prediction accuracy 
on data not available during the optimization process, that 
is, one wants a function that generalizes well beyond the 
given set of training data points. To improve the general- 
ization performance of the estimated function the space 
of solutions for w is restricted by minimizing ||w|| 2 in 
addition to the squared loss (equation 2) 

m 

mm\\w\\ 2 + cT{f{ Xi )- yi ) 2 , (2) 

w,b — 7 

i— I 

where the parameter C controls the complexity of the 
solution. Large values of C lead to a smaller error on the 
training data points at the expense of a complex func- 
tion, while small values of C result in simple (flat) linear 
functions at the expense of larger training errors. The 
ridge regression [39] problem in Equation 2 can be trans- 
formed into the SVR optimization problem by replac- 
ing the quadratic loss with the s -insensitive linear loss, 
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h(f(xd — Ji) = max{0,/(#/) — ji) which is shown in 
Figure IB: 

m 

min|M| 2 +cVf + r 

w,b — 7 

subject to: + £ (3) 

yt-fiPH) <s + t 
S,t > o 

From Equation 3 it is clear that only data points with 
— J/ 1 > £ contribute to the solution, since other- 
wise the slack variables are zero. The choice of the 
e-insensitive loss function hence induces a sparse solu- 
tion that only depends on data points with non-zero loss, 
which are called support vectors' [27]. In practice the 
£-zone of the loss makes the function estimation more 
robust against measurement noise in the target values 
and the s parameter is set to match the level of noise 
in the target values, if known. The automatic choice of 
parameters C and s will be explained later. Robustness of 
/ with respect to outliers in the target values is achieved 
by the linear part of the loss function (Figure IB). Since 
outliers are not an issue for the envisaged estimation of 
diffusion coefficients, where the training data set consists 
of simulated cyclic voltammograms, the loss function is 
replaced by the ^-insensitive quadratic loss, l 2 (f(xi) — 
ji) = max{0, {fix) — y) 2 — e], shown in Figure 1C. This 
exchange of the loss function allows to solve the SVR 
optimization problem by the Newton algorithm for lin- 
ear [40] and nonlinear function estimation [41,42]. For the 
8 -insensitive quadratic loss the optimization problem in 
Equation 3 transforms into the unconstrained optimiza- 
tion problem (4): 

m 

min || w\\ 2 + C V ll «w, Xi ) - yd . (4) 

w,b — 7 

i=l 

Linear functions might not provide the necessary flex- 
ibility for the estimation of diffusion coefficients from 
experimental data. To extend SVR to nonlinear function 
estimation one assumes that the function/^) resides in a 
Hilbert space H. Under this assumption the minimization 
of || w || 2 is replaced by the minimization of the squared 



function norm ||/||^ in Hilbert space and Equation 4 
can be reformulated as: 

m 

min \\f\\ 2 n + C^/ £ 2 (f ~ yi) • (5) 
^ i=l 

In this form the optimization problem (5) is not solv- 
able, since/ is unknown. Yet, according to the representer 
theorem [43] the evaluation off at point %i is given by a 
linear combination of kernel functions: 

m 

f (Xi) = J2fok(x i ,Xj). (6) 

i=l 

This permits the minimization of l 2 (f(xi) — yi) in terms 
of the coefficients instead of / Further, Equation 6 
allows one to rewrite the squared norm of the function: 

ll/ll^ = (/'^ = EM(*(^0^(.^;))^. 

uj 

In the final step the dot product between kernel functions 
can be expressed as [k(xu .),k(.,Xj)) n = k(xuxj), where 
we exploited the reproducing property [44] of the Hilbert 
space given by f(xi) = (f,k(.,Xi)) n . By combining these 
reformulations, the nonlinear SVR optimization problem 
is: 

m 

min 0 T K/3 + C T l 2 s (K t p + b-y t ), (7) 

where Kg = k(xu xj) is the kernel matrix and K[ denotes its 
i-th row. Similar to the linear case, the objective function 
in (7) contains a regularization term, \\fW\i = fi T Kp, and 
a loss function term, l 2 (I<i/3 + b — yi). As discussed above 
for the linear case, parameter C controls the complexity of 
the estimated function. 

Table 1 lists the two kernel functions which are sub- 
sequently used to estimate diffusion coefficients from 



Table 1 Kernel functions 



Type 


Function 


Linear 


k(x jl x j ) = (xi,Xj) 


RBF 


k(x h Xj) = exp(-y||x/-x y || 2 ) 
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cyclic voltammograms. The parameter y for the radial 
basis function (RBF) kernel, together with the regular- 
ization parameter C, and the loss function parameter s 
were automatically chosen during SVR function estima- 
tion by minimizing a bound on the leave-one-out error. 
The leave-one-out is the average of errors across single 
data points that were removed from the set before the 
function estimation. It is an almost unbiased estimate of 
the expected error on unseen data, but requires the func- 
tion to be estimated m times. To avoid this, we minimized 
a bound on the leave-one-out error with a Quasi-Newton 
algorithm [45,46]. The described algorithms were imple- 
mented within MATLAB®. 

Gaussian processes 

A Gaussian process is defined as a collection of random 
variables, any finite number of which have consistent joint 
Gaussian distributions [28]. A Gaussian process general- 
izes the concept of the Gaussian distribution over vectors 
to a distribution over functions and is fully defined by its 
mean function m(x) and covariance function k(x,x'). In 
order to draw samples from a Gaussian process one first 
evaluates the mean and covariance function at a finite set 
of data points to obtain a mean vector \i{ = m(xi) e IR m 
and covariance matrix £// = k(Xi,xj) e W nxm ) and sub- 
sequently draws a vector of function values / ~ A/"(/x, E) 
where A/"(/x, E) denotes a multi-dimensional Gaussian 
distribution with mean vector fi and covariance matrix E . 
Specifying the mean and covariance function thus reflects 
prior knowledge about the properties, for example, the 
smoothness of the estimated function. 

Finding the function values f* for previously unseen 
test data points is possible by considering the joint 
distribution: 



E* 

Ejfe; 



(8) 



where /x* is the vector of test means, E* the covariance 
for training-test data points and E** the covariance for 
test data points. Since the joint distribution is Gaussian, 
the posterior distribution of /*, given the known function 
values at the training data points, is again Gaussian: 



(9) 



Thus calculating the distribution of/* just requires eval- 
uation of the mean vectors and covariance matrices, and 
the inversion of the training set covariance matrix by a 
Cholesky decomposition [47] . 

The choice of a particular mean and covariance function 
corresponds to the training of a Gaussian process. In the 
absence of precise prior information about the functional 



relationship underlying the data it is best to parameter- 
ize the mean and covariance function and estimate the 
parameters from the available data. Usually the training 
is restricted to identifying a suitable covariance function, 
after subtracting the empirical mean from the regression 
targets;//. Table 2 lists the covariance functions considered 
for the estimation of diffusion coefficients. An additional 
term o n 8{j is added to each covariance function, with 8{j 
being Kroneckers delta, in order to model Gaussian noise 
in the regression targets. 

The parameters 0 of the covariance function, e.g. 0 = 
(a 2 ,/) for the squared exponential covariance function, 
are determined by maximizing the probability of the 
data given the parameters. Since the data distribution is 
assumed to be Gaussian the logarithm of this probability 
is [28]: 

L = logp(y\x,0) 
= - l - log |E| - X -(y- /x) r E- 1 0/ - fi) - ™ log(27T) . 

(10) 

After calculating the partial derivative of Equation 10 
with respect to 0 one can use a conjugate gradients algo- 
rithm to optimize the parameters. It should be noted that 
the first term in the objective function (10) regularizes 
the solution, while the second term measures the quality 
of the data fit, and the third term is a constant inde- 
pendent of the data. In contrast to the SVR algorithm 
(Section "Support vector regression") there is no regular- 
ization parameter C that needs to be set, since there is an 
implicit trade-off between function complexity and data 
fit. For the Gaussian process regression we used the freely 
available Gpml toolbox for Matlab® [28]. 

Nicholson-Shain equation approach 

The analysis of voltammetric measurements relates a sys- 
tem parameter [11], diffusion coefficient A and the exper- 
imental variables, such as the initial concentration en, the 
electrode area A, scan rate v, and temperature T, as well as 
other parameters (here: number of transferred electrons 
ri), of the electrochemical system to the electric current 
i flowing through the electrode. For the dimensionless 
current function x the relationship (11) holds [1]. 



nFAc 0 ^/D^v ' 



(11) 



Table 2 Covariance functions 



Type 


Function 


Linear 


k(x,x') = a 2 (1 +(x,x')) 


Squared exp. 


k{x l x') = a 1 exp (-||x-/|| 2 /2/ 2 ) 
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with Faraday constant F = 96485.339 C mol -1 , and gas 
constant R = 8.314472 J mol -1 K" 1 . If the reaction 
under investigation is a simple reversible electron transfer, 
the dimensionless current at the peak approaches a value 
[1], i.e. ^/nxp = 0.4463, independent of any parameter 
describing the electrochemical system. During voltam- 
metric experiments the current is measured, while v, T, A, 
and Co are known or under control of the experimenter. 
Therefore, the diffusion coefficient of the electrochemi- 
cal species can be determined by solving Equation 11, in 
particular at the voltammetric peak: 



(0A463nc 0 AFfnFv ' 

where the current of the forward peak (Figure 2) is 
extracted from the experimental cyclic voltammogram. 

Although diffusion coefficients can be calculated from 
Equation 12 given an experimental cyclic voltammo- 
gram, the assumption of a known dimensionless cur- 
rent ^/nxp is violated for electrode reactions deviating 
from the simple diffusion-controlled one-electron trans- 
fer. For more complex cases, ^/ttx depends on vari- 
ous variables [1], including rate constants that are often 
unknown, and examples are the E qr (quasi-reversible 
electron transfer), the EC (reversible electron transfer 
with irreversible chemical follow-up reaction), and the 
E qr C (quasi-reversible electron transfer with irreversible 
chemical follow-up reaction) mechanisms, described in 
Section "Results and discussion". Then, the peak cur- 
rent Xp changes in a nonlinear fashion depending on 
the kinetic rate constants of the electron transfer or the 
follow-up reaction. For the case of the EC mechanism, the 
dependence on the dimensionless follow-up rate constant 
ki = ki/a (with k\ being the first order rate constant, 



0.03 




-o.oi 



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 

e/y 

Figure 2 Example cyclic voltammogram. The forward peak, half 
peak, and reverse peak potentials {E f ° f , E p /2, E^), and currents (/p 0r , 
ip/2, ip V ), which are used to calculate the manually extracted features 
are indicated. 



and a = nFv/RT) is shown in Figure 3. In this case calcu- 
lation of the diffusion coefficient by the Nicholson-Shain 
equation is only possible if the rate constant of the EC 
mechanism has a very small value of \o%(k\) < —3. If the 
exact value of the rate constant is unknown, it might still 
be possible to estimate the diffusion coefficient by regres- 
sion algorithms such as SVR (Section "Support vector 
regression"), or GPR (Section "Gaussian processes"). 

Simulations 

Voltammetric measurements were simulated by the 
CVSIM program included in the EASIEST software pack- 
age [48]. Common parameters used in all simulations 
are listed in Table 3 while the remaining parameter val- 
ues of the electrochemical system are given separately in 
Section "Estimation from simulated data" for each ana- 
lyzed mechanism. In all simulation runs the CVSIM pro- 
gram was configured to use the MetanI integrator and 
the technique of spline collocation [49] with 10 colloca- 
tion points. 

Fitting of simulation parameters 

Fitting simulation parameters by globally minimizing the 
sum of squared errors between experimental and sim- 
ulated cyclic voltammograms was used to identify the 
formal potential E°, the heterogeneous electron transfer 
rate constant k s , and D for the E qr and E qr C mechanisms, 
as well as the homogeneous chemical rate constant k\ for 
the E qr C mechanism from the experimental cyclic voltam- 



mograms. The resulting D were used as approximations 


0.5 




0.49 




0.48 




X 0.47 
0.46 




0.45 
0.4463 




0.44 
- 


5 -4 -2 0 2 4 6 

logOi) 


Figure 3 Variation of the dimensionless peak current ^/jtXp with 
the dimensionless rate constant k\ for the EC reaction 
mechanism. The dimensionless peak current v^Xp is constant only 
for very small (log(/ci ) < -3) and very large (log(/ci ) > 4) values of 
the rate constant. In the former case, the limiting value of 0.4463 is 
approached; for an explanation of the black bar on the abscissa, see 
text, Section "EC mechanism — dependence on k\ ". 
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Table 3 Common simulation parameters for all 
mechanisms 


Parameter and unit 


Value 


Scan rate \/(Vs _1 ) 


0.2 


Potential step size A/: (mV) 


1 


Initial concentration Co (mmol/L) 


0.4 


Temperature T (°K) 


293.15 


Electrode area A (cm 2 ) 


0.064 


Symmetry factors 


0.5 



to the real value. To achieve a homogeneous fit across all 
experimental voltammograms and avoid large deviations 
for small-amplitude voltammograms, the currents of sim- 
ulated and experimental voltammograms were scaled to 
the interval [—1,1], prior to computing the objective func- 
tion. The minimization of the sum of squared errors mea- 
sure was carried out by an interior point algorithm [50] 
as implemented in the KNITRO software library [51]. Val- 
ues for the diffusion coefficients obtained by this approach 
served as a reference for judging the accuracy of coeffi- 
cients estimated by SVR and GPR for the experimental 
cyclic voltammograms of the organometallic complexes 
(Section "Estimations from experimental data"). 

Results and discussion 

In a first step (Section "Estimation from simulated data") 
the approach based on the Nicholson-Shain equation 
and the regression algorithms SVR and GPR were used 
to estimate diffusion coefficients for simulated cyclic 
voltammograms with known diffusion coefficients. This 
allowed us to compare the performance of the different 
methods in terms of accuracy of the estimated diffu- 
sion coefficients. Furthermore, the simulated data helped 
to analyze the dependence of accuracy on the rate con- 
stants of the underlying reaction mechanism. In a second 
step (Section "Estimations from experimental data") the 
regression algorithms, trained on the simulated data, were 
used to estimate D for experimental cyclic voltammo- 
grams with unknown diffusion coefficients. 

Estimation from simulated data 

Cyclic voltammograms were simulated as described 
in Section "Simulations" for the following three reac- 
tion mechanisms with the respective model parameters 
(Table 4): 

EC : A U B X C £°, h (13) 

E qr : A U B £°, k 8 , a (14) 

E qr C : A^B^iC £°, Jt s , a, h (15) 



Table 4 Simulation parameters for the EC, E qr , and E qr C 
mechanism 



ecaHb^c 


/ci (s- 1 ) 


0.001,0.01,0.1, 1, 10, 100, 1000 


D(cm 2 s~ 1 ) 


MO" 6 , 1 .5-1 0- 6 5 -10- 5 ,5.05 -lO" 5 


E°(V) 


0.3 


^start (V) 


0 


£rev (V) 


0.7 


±e 

E qr :A^B 


k s (cm s 1 ) 


0.001,0.005,0.01,0.02 0.1,0.5, 1 


D(cm 2 s- 1 ) 


as EC 


E°(V) 


0.2108 


Estaa (V) 


0 


£rev (V) 


0.5 


E qr C: A^B^C 


/ci (s- 1 ) 


as EC 


k s (cms" 1 ) 


as E qr 


D(cm 2 s- 1 ) 


as EC 


£°(V) 


0.2775 


^start (V) 


0 


£rev (V) 


0.6 



For each mechanism one combination of diffusion coef- 
ficient and rate constant(s) was used per simulation run 
(Table 4). The resulting simulated data set comprised a 
total of 700 simulated voltammograms for the EC mecha- 
nism, 1400 for the E qr mechanism, and 2800 for the E qr C 
mechanism. This full data set was randomly partitioned 
into training and test data sets, each containing 50% of the 
simulated cyclic voltammograms. Only the training data 
set was used for the function estimation by SVR and GPR, 
while the performance of each algorithm was assessed on 
the test data set. 

First we compared the accuracy of the diffusion coeffi- 
cients calculated by the approach based on the Nicholson- 
Shain equation, SVR with linear kernel, SVR with RBF 
kernel (Table 1), GPR with linear covariance function, 
and GPR with squared exponential covariance function 
(Table 2) for each of the three reaction mechanisms 
(Figure 4). For the simulated data the true value of the 
diffusion coefficients is known and can be used as a ref- 
erence. Prior to applying the SVR and GPR algorithm we 
reduced the dimensionality of the simulated CVs from 
1401 (each dimension corresponds to one current value 
of the CV) to 5, by projecting the data to the subspace 
spanned by the 5 dominant principal components. This 
preprocessing by principal component analysis (PCA) 
explained 99% of the variance in the EC mechanism data, 
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EC Eqr EqrC 



Figure 4 Distributions of absolute errors on a logarithmic scale 
for estimated diffusion coefficients in cm 2 s -1 on the test data 
sets for simulations for EC E qr/ and E qr C mechanisms. Black 
horizontal bars indicate the mean of the error distributions. The SVR 
and GPR algorithms used PCA preprocessing. 



and 99%/98% of the variance in the E qr /E qr C mechanism 
data respectively. 

In the Nicholson-Shain Equation 12 the diffusion coef- 
ficient is a quadratic function of the forward peak current 
/p° r . It is therefore not surprising that the nonlinear func- 
tions estimated by SVR with RBF kernel and GPR with the 
squared exponential covariance function are better suited 
to describe the relationship between cyclic voltammo- 
gram and diffusion coefficient for all investigated mech- 
anisms. There is a significant difference between the 
means of the error distributions of SVR with linear/RBF 
kernel, and GPR with linear/squared exponential covari- 
ance function, as shown in Figure 4. In addition, the 
nonlinear functions estimated by SVR and GPR consis- 
tently yield lower errors on average than the Nicholson- 
Shain equation approach for all the reaction mechanisms. 
Please note that the broad range of errors induced by the 
Nicholson-Shain equation based approach is not surpris- 
ing, due to the non-constant dimensionless peak current 
Xp in the test voltammograms, although this method 
assumes a constant value (Figure 3). 

After finding an appropriate kernel (RBF) and covari- 
ance function (squared exponential) for the regression 
algorithms, we analyzed the influence of different prepro- 
cessing methods on the estimated diffusion coefficients 
(Figure 5). For the downsampling method the number 
of dimensions in each simulated cyclic voltammogram 
was reduced by a factor of 20, i.e. retaining only every 



c 




EC Eqr EqrC 



Figure 5 Distributions of absolute errors on a logarithmic scale 
for diffusion coefficients in cm 2 s -1 estimated on the test data 
sets for simulated mechanisms EC E qr/ and E qr C. Black horizontal 
bars indicate the mean of the error distributions. 



20th sample, while preprocessing by PCA worked as 
described above. The manual preprocessing method used 
the seven features derived from the potentials and cur- 
rents of the cyclic voltammogram shown in Figure 2, 
which were chosen as those being most prominent and 
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c 




o 

a3 -8 


SVR RBF 


0) 
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O 

.a -9 
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(T3 

a> 
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o 

-11 


GPR squared exponential 


-12 




3-2-10123 




log^i/s" 1 ) 


Figure 6 Mean of the absolute error, on a logarithmic scale, for 


diffusion coefficients determined by SVR with RBF kernel, GPR 


with squared exponential covariance function, and the 


Nicholson-Shain equation approach for the EC mechanism 


depending on the rate constant k-\ . Shading around curves 


indicates 95% confidence intervals for the mean. The dotted line 


indicates the spacing used for the diffusion coefficients in the 


simulated data; PCA preprocessing was used for predicting 


coefficients with SVR and GPR. 



Bogdan etal. Journal of Cheminformatics 2014, 6:30 
http://www.jcheminf.eom/content/6/1/30 



Page 8 of 13 



£ -6 



Nicholson-Shain 




GPR squared exponential 



-3 -2.5 -2 -1.5 -1 -0.5 0 

log(A; s /cm s _1 ) 

Figure 7 Mean of the absolute error, on a logarithmic scale, for 
diffusion coefficients determined by SVR with RBF kernel, GPR 
with squared exponential covariance function and the 
Nicholson-Shain equation for the E qr mechanism depending on 
the rate constant k s . Shading around curves represent 95% 
confidence intervals for the mean. The dotted line indicates the 
spacing used for the diffusion coefficients in the simulated data; PCA 
preprocessing was used for predicting coefficients with SVR and GPR. 



commonly used for analysis. These manually extracted 
features include the forward peak, half peak, and reverse 
peak potentials (£p° r , £p/2> £p ev )> the difference between 
forward and reverse peak potential E^ Y — £p ev , the for- 
ward peak current /p° r , and the ratio between forward and 
reverse peak current Note, that this is not the 

peak current ratio as defined by Nicholson [52]. 

As shown in Figure 5 the manual preprocessing method 
yields the lowest accuracy of the estimated diffusion coef- 
ficients for both regression algorithms and all reaction 
mechanisms. This indicates that, albeit being helpful for a 
human observer, the manually extracted features discard 



too much of the information contained in the full cyclic 
voltammogram. The performance differences between 
the PCA and downsampling method are small, yet PCA 
works best for the E qr C mechanism, while there is no dif- 
ference between the preprocessing methods on the EC 
and E qr mechanism in conjunction with the SVR algo- 
rithm. For the GPR algorithm PCA is slightly better for 
the EC mechanism, while downsampling is better for the 
E qr mechanism. We used PCA preprocessing for both 
regression algorithms when estimating diffusion coeffi- 
cients from real data, as it allows to judge the quality of 
the data reduction depending on the amount of explained 
variance. 

EC mechanism — dependence on k<\ 

Figure 6 shows the average absolute error between esti- 
mated and true diffusion coefficient values depending on 
the rate constant k\ for the EC mechanism. The dot- 
ted line in Figure 6 marks the spacing used for D in 
the simulations and can be considered as the baseline 
error of a simple table lookup, e.g. if the diffusion coef- 
ficient is determined from a table listing values of D for 
different rate constants k±. Confidence intervals for the 
average absolute error at the 95% level were computed 
by a bootstrap method with 1000 bootstrap samples [53]. 
While the accuracy of the diffusion coefficients estimated 
by the regression algorithms is virtually independent of 
the rate constant value, as indicated by the flat error 
curves, the accuracy of diffusion coefficients calculated 
with the Nicholson-Shain equation degrades with increas- 
ing k\ and the error increases above the baseline error for 
ki > 1 s" 1 . 

This behaviour of the results from the Nicholson- 
Shain equation based approach is expected due to the 
dependence of the dimensionless peak current v^Xp 
on the dimensionless rate constant K\ described in 




Bogdan etal. Journal of Cheminformatics 2014, 6:30 
http://www.jcheminf.eom/content/6/1/30 



Page 9 of 13 



MeO 




P — tBu 



xz 




ci / \ 

Ph Ph 



2a 



XI 




OMe 



OMe 



2b 



Figure 9 Chemical structures of compounds 1 , 2a, and 2b for 
which data were analyzed in this work. 



Section "Nicholson-Shain equation approach". The black 
bars on the abscissa of Figures 3 and 6 mark the region 
where the dimensionless peak current does not deviate 
significantly from the constant asymptotic value of 0.4463. 
It should be noted that the scales on the abscissa in both, 
Figures 3 and 6, are equivalent apart from a constant off- 
set since, for n = 1, log(/ci) = log(/<i/s -1 ) — log(^/s _1 ) 
and log(<2/s -1 ) ^ 0.9. The quality of the diffusion coeffi- 
cients calculated by the Nicholson-Shain equation for rate 
constants in this range (log(/<i/s -1 ) e (— oo, — 1]) is even 
better than the coefficient values estimated by the SVR 
algorithm with RBF kernel (Figure 4). Since the exact value 
of the rate constant is often not known in practice, how- 
ever, it seems to be better to resort to one of the regres- 
sion algorithms for finding the diffusion coefficient in 
general. 



E qr mechanism — dependence on k s 

For the E qr mechanism the error incurred by the SVR and 
GPR algorithms is constant for electron transfer rate con- 
stant values log(/r s /cm s _1 ) > —2.5 (Figure 7). Below this 
value one can observe a slight increase in the average abso- 
lute error from 10" 8 to 10" 7 ' 3 for SVR and from 10" 11 to 
10 -io.5 for GPR 

The error of the Nicholson-Shain equation approach, 
on the other hand, increases from 10 -7 to 10 -5 for elec- 
tron transfer rates log(/c s /cm s _1 ) in the range [—3, —2] 
and thus shows a stronger dependence of diffusion coef- 
ficient accuracy on the rate constant. The absolute error 
approaches the order of magnitude of the values of D. 
Overall, the regression algorithms SVR and GPR yield 
a more accurate estimate of the diffusion coefficient 
for simulated E qr voltammograms in comparison to the 
Nicholson-Shain equation and to table look-up. 

E qr C mechanism — dependence on /ci and k s 

In contrast to the EC and E qr reaction mechanisms, 
the E qr C mechanism is governed by two rate constants 
k\ and k s (Table 4). For the three tested methods the 
error surfaces are rather flat and only slightly increase 
for log(/c s /cm s _1 ) between -1.5 and 0 (Figure 8). The 
largest difference between two points on the logarithmic 
error surface is 0.48 for the Nicholson-Shain equation 
approach, 0.36 for SVR, and 0.53 for GPR. Notably, the 
global error level for the E qr C mechanism is on the same 
scale as the error level for the E qr and EC mechanism 
(Nicholson-Shain: [-5.6,-5.1], SVR: [-7.6,-7.3], GPR: 
[—11.4,-10.9]), which indicates that the proposed esti- 
mation of diffusion coefficients is extensible to more 
complex reaction mechanisms. 

Estimations from experimental data 

The estimation of diffusion coefficients was applied to 
three experimental data sets, each containing 80 experi- 
mental cyclic voltammograms. The first data set consisted 
of measurements for iridium complex 1 [22], the second 
and third of those for ruthenium complexes 2a and 2b 
[54,55] (see Figure 9 and Section "Experimental"). The 
reaction mechanisms (E qr C for complex 1, and E qr for 
complexes 2a and 2b) were established earlier [22,54]. 



Table 5 Parameter values yielding the best fit between 
simulated and experimental cyclic voltammograms for the 
three metal complexes 


Parameter 


1 


2a 


2b 


E° (V) 


0.2767 


0.2084 


0.1938 


k s (cms- 1 ) 


0.0232 


0.0199 


0.0118 




0.1473 






D(cm 2 s~ 1 ) 


1.5846e-5 


1.0535e-5 


1 .0824e-5 
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Figure 1 0 Experimental cyclic voltammograms for complexes 1 , 2a, 2b (from left to right), indicated by solid lines, for a scan rate of 0.5 V 
s -1 and initial concentrations of 0.2, 0.4, 0.6, 0.8 mM. Electroactive area:/\ = 0.064 cm 2 ; potential values vs. a Ag/Ag + reference electrode 
[22,54]; the simulated cyclic voltammograms which are the result of the parameter fitting process are indicated by dashed lines. 



Since the true value of the diffusion coefficient is 
unknown for each of the experimental data sets, we fit- 
ted simulated cyclic voltammograms to the experimental 
signals by optimizing the formal potential E°, the rate con- 
stants ki, k s , and the diffusion coefficient D as described in 
Section "Fitting of simulation parameters". The fitted dif- 
fusion coefficients serve as a reference point for compar- 
ing the values calculated by the regression algorithms and 
the Nicholson-Shain equation approach. Table 5 lists the 
parameter values that yield the best fit between simulated 
and experimental cyclic voltammograms and Figure 10 
gives an impression of the fit quality. The best fit was 
obtained for the E qr reaction of complex 2a with an aver- 
age absolute error between simulated and experimental 
signals of 0.75 /xA, followed by the E qr reaction of 2b (1.09 
/xA), and the E qr C reaction of 1 (3.23 /jlA). 

Based on the results with simulated data (Section 
"Estimation from simulated data") we used SVR with 
RBF kernel and GPR with squared exponential covari- 
ance function in conjunction with the PCA preprocessing 
method to estimate diffusion coefficients for the experi- 
mental data sets. For complex 1, the training data con- 
sisted of all 2800 simulated cyclic voltammograms created 
for the E qr C mechanism (Section "E qr C mechanism — 
dependence on k\ and /c s "), while 1200 simulated cyclic 
voltammograms for the E qr mechanism served as training 



by different methods for the experimental cyclic 
voltammograms; bold values: best matches with respect 
to parameter fitting results 



data for 2a/2b. In order to have the voltammograms on a 
comparable scale the current was normalized by multiply- 
ing the signal with the factor {cq^/v)~ 1 . 

The trained regression algorithms and the approach 
based on the Nicholson-Shain equation were then used 
to calculate the diffusion coefficient for each of the 80 
experimental voltammetric curves. Since the diffusion 
coefficient of the electrochemically active species should 
be constant across measurements with different scan rates 
and initial concentrations, we averaged the 80 calculated 
coefficients to arrive at the final estimate. Table 6 lists the 
diffusion coefficients determined by parameter fitting, the 
Nicholson-Shain equation approach, and the regression 
algorithms. 

For 1 the diffusion coefficient estimated by GPR is the 
best match with respect to the fitted coefficient value. 
Although there is only a small difference in the esti- 
mates of SVR and GPR, the best diffusion coefficient 
estimates for 2a/2b are provided by SVR. In contrast to 
the regression algorithms, the Nicholson-Shain equation 
consistently underestimates the diffusion coefficient value 
on all data sets. 

To further assess the quality of the estimated values we 
repeated the simulation of cyclic voltammograms with 
the estimated diffusion coefficients and calculated the 
discrepancy between simulated and experimental voltam- 
metric signals (Table 7). In comparison to the parameter 
fitting method the average absolute error increases only 



Table 7 Average absolute error of currents in /jlA between 
simulated and experimental cyclic voltammograms 





1 


2a 


2b 




1 


2a 


2b 


Parameter-Fit 


1.58 


1.05 


1.08 


Parameter-Fit 


3.23 


0.75 


1.09 


Nicholson-Shain 


1.07 


0.84 


0.82 


Nicholson-Shain 


3.74 


1.41 


1.74 


SVR 


2.32 


1.07 


1.09 


SVR 


4.66 


0.76 


1.10 


GPR 


1.54 


1.10 


1.13 


GPR 


3.19 


0.803 


1.14 
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Figure 1 1 CPU time in minutes required by the parameter fitting method, and the regression algorithms for the three organometallic 
complexes 1, 2a, and 2b. Hatched bars indicate the portion of time required by the SVR and GPR algorithm without the simulations. All 
measurements were made on an Intel® Xeon® 5 1 50 processor with 2.66 GHz and 8 GB of main memory. 



slightly for the coefficients estimated by SVR for 2a/2b, 
and GPR on all organometallic complexes. The diffusion 
coefficients obtained by the Nicholson-Shain equation for 
1, 2a, and 2b, and by the SVR algorithm for 1 are of 
inferior quality. 

The parameter fitting approach usually yields reliable 
estimates of the diffusion coefficients in practice, but 
at the expense of long computational times (Figure 11). 
In contrast, the creation of simulated data followed by 
regression algorithm training and estimation of diffusion 
coefficients only takes a small percentage of the parameter 
fitting time (3-20%). If simulated data is already available, 
this percentage is further reduced to 0.01-0.06%, which is 
beneficial if large amounts of experimental data need to 
be analyzed. 

Experimental 

Voltammetric signals in each data set in Section 
"Estimations from experimental data" were acquired twice 
for ten scan rates of 0.02, 0.05, 0.1, 0.2, 0.5, 1.003, 
2.007, 5.120, 10.240, and 20.480 V s" 1 , and four differ- 
ent initial concentrations cq of 0.2, 0.4, 0.6, 0.8 mmol 
L _1 in a dichloromethane electrolyte with 0.1 M tetra-n- 
butylammonium hexafluorophosphate as supporting elec- 
trolyte at a Pt electrode (for further experimental details, 
see [22,54]). The scanning potential varied between 0 and 
0.6 V for 1, and between 0 and 0.5 V for 2a/2b with an 
increment of 1 mV in each case. 

Conclusion 

The results presented in this work show the feasibility of 
estimating diffusion coefficients from experimental cyclic 
voltammograms by regression algorithms trained on sim- 
ulated data. This approach is generic in the sense that it 
is not restricted to a particular reaction mechanism and 



range of rate constants, as demonstrated by the results 
obtained on simulated data for the EC, E qr , and E qr C 
mechanisms. On simulated data the accuracy of diffusion 
coefficients estimated by SVR with RBF kernel and GPR 
with squared exponential covariance function is higher as 
compared to the Nicholson-Shain equation approach over 
a wide range of rate constants. The best preprocessing 
method for estimating D with the regression algorithms 
turned out to be the principal component projection of 
the cyclic voltammograms. Projecting the data to the 
subspace spanned by the first five principal components 
apparently retains important shape information that is 
discarded by the manual extraction of prominent peak 
features. This indicates that the commonly used evalu- 
ation of the limited set of human recognizable features 
related to voltammetric peaks might not be optimal for 
data evaluation in all cases. For the three experimental 
data sets, estimation with GPR yielded diffusion coeffi- 
cients that closely matched the values determined by the 
classical parameter fitting approach, whereas SVR showed 
comparable performance only for 2a/2b. These results 
indicate that GPR with a squared exponential covariance 
function is better suited than SVR to reliably determine 
diffusion coefficients from experimental data. Further- 
more the GPR based determination of the diffusion coef- 
ficient requires less computational time in contrast to the 
parameter fitting approach. 
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